Word-Streams for Representing Context in Word Maps

نویسنده

  • Arnulfo P. Azcarraga
چکیده

The most prominent use of Self-Organizing Maps (SOMs) in text archiving and retrieval is the WEBSOM. In WEBSOM, a map is first used to reduce the dimensionality of the huge term frequency table by training a so-called word-category map. This wordcategory map is then used to convert the individual documents into their respective document signatures (i.e. histogram of words) which form the basis for training a document map. This document map is the final text archive. WEBSOM has been shown to be a powerful and versatile text archiving system. However, it spends (wastes) enormous computer resources in the computation of the left and right context of each and every word that appears in any of the documents in the text corpus. This paper presents an alternative scheme for incorporating context in the encoding of the words in such a way that the computation of the probabilistic centroid, which is inherent in the SOM training algorithm, is taken full advantage of. Several experiments are conducted to compare this new scheme with WEBSOM’s context averaging scheme. 1 Digital Libraries and Text Archives The Information Age ushers in an ultra-modern society where every bit of information is in digital form and is within electronic reach. The wiring of society, the World-Wide-Web being the early manifestation of this, will pave the way for the interconnection of humanity, allowing anyone to exchange information with everybody else. A critical component of such a wired-up world is the archiving of information that allows for their efficient and convenient retrieval. Implementations of such archives, which are loosely referred to as “digital libraries”, have become common place. Though digital libraries are designed to contain various types of objects, such as pictures, diagrams, satellite images, medical radiographies, magnetic resonance images, music and sound files, videoclips, and even full movie videos, the bulk of these libraries are concerned with digitized text documents. Understandably, then, text archiving and retrieval has received enormous attention as a developmental area of research. Technqiues vary widely in the specific processes involved in text archiving and retrieval, and there has been considerable research progress in the various approaches to feature extraction and document classification. These techniques span the fields of Artificial Intelligence, Statistical Pattern Recognition, and Artificial Neural Networks. Yang [1] recently compared various general methods for text classification. Benchmark cases are now available to compare new methods with the older ones. The conditions for text classification have drastically changed over the years and this explains why newer methods are continually being considered despite the relative efficiency of the older methods and the fact that these are already well understood. Because of the Web, plus the fact that personal computers and other widely available computing devices have become so powerful, digital libraries have ventured into applications that were not feasible, or were not thinkable, even just a decade ago. The prospect of archiving subsets of the Web is an example. Web-based text archives that are being considered nowadays are no longer collections of hundreds or thousands of pages. These collections now run in the order of millions of documents. Words (terms) that appear at least once in the text corpus, even after removal of very common words (stop words), run in tens of thousands to over a hundred thousand! Many techniques that have proved effective in the past have been rendered obsolete by the drastic increase in the size of the text archive being considered and the increase in the number of distinct words that appear in the corpora. More and more, statistical approaches to document classification are gaining ground. Statistical techniques do not make intricate analysis of words and their synonyms, and rarely depend on human intervention for removing words that do not contribute much to the eventual classification of the documents. Instead, statistical methods mainly use the relative frequency of appearance of words in the various documents. The field of Artificial Neural Networks is a collection of models and techniques that are tightly related to statistical pattern recognition. Among the Neural Network models that have found applications in document classification, and especially in text archiving and retrieval, is the so-called “Self Organizing Map” (SOM). Section 2 of this paper discusses the general characteristics of Self-Organizing Maps and the various ways SOMs are used in the different stages in text classification. WEBSOM is discussed in detail in Section 3, followed by some discussions as to where and why the WEBSOM could be further improved. Section 4 discusses in detail the use of SOMs at the word map level, and proposes a new way of representing context in order to achieve the same kind of word-category maps reported in WEBSOM. Section 5 presents some experimental results which indicate how the alternative scheme could possibly produce word maps comparable to those produced by WEBSOM. 2 SOMs in Text Archiving Systems Self-Organizing Maps are typically 2D maps which are structured as either rectangular or hexagonal (honeycomb) lattices. SOMs are trained using data with no information as to the categories to which each training pattern belongs. Once training is completed, the SOM is labeled and clusters are identified which reflect the relationships among the patterns in the input environment. Abundant literature on SOM exists, including a large number of reported applications. Some introductory materials on SOM can be found in [2][3][4]. The most prominent use of SOMs in text archiving and retrieval has been the WEBSOM [5][6][7], whose general architecture is shown in figure 1. The WEBSOM uses self-organizing maps in two ways. SOMs are first used to reduce the dimensionality of the huge term frequency table by training a word-category map. This word-category map is then used to convert the individual documents into their respective document signatures (i.e. histogram of words) which form the basis for training a document map. The document map is the final text archive. It organizes documents into a simple 2D hexagonal grid in such a way that similar documents are found near each other in the map. This feature allows for versatile document retrieval and paves the way for other creative ways of using the document map, such as in document distribution and filtering. Each document is typically preprocessed to extract the main text from the document, and to remove various other markings like company logos, e-mail headings and signatures (in the case of e-mail documents), figures, equations, etc. The main text is then reduced to a simple list of words, where very common words which appear in a so-called “stop list” have been removed. Further stemming of the words is also done (like in the case of the WEBSOM’s archiving of a collection of documents written in Finnish), so that words with the same root but just differ in various suffixes, inflections, etc. are encoded as one word.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Effect of Three Vocabulary Learning Strategies of Word-part, Word-card and Context-clue on Iranian High School Students’ Immediate and Delayed English Vocabulary Learning and Retention

The present study was an attempt to compare the effect of three VLSs, namely word-part strategy, word-card strategy and context-clue strategy on immediate and delayed English vocabulary retention of Iranian third grade high school students. To this end, 90 students, studying at three high schools in Tabriz, in three intact groups, were considered as the participants of the study. In order to en...

متن کامل

Supervised and Unsupervised Word Sense Disambiguation on Word Embedding Vectors of Unambigous Synonyms

This paper compares two approaches to word sense disambiguation using word embeddings trained on unambiguous synonyms. The first one is an unsupervised method based on computing log probability from sequences of word embedding vectors, taking into account ambiguous word senses and guessing correct sense from context. The second method is supervised. We use a multilayer neural network model to l...

متن کامل

A Model of E-Loyalty and Word-Of-Mouth based on e-trust in E-banking services (Case Study: Mellat Bank)

Customers extend robust trust to a business when they believe the business puts their interests first. Good experience of banking services and recommendations of other customers can increase trust. Loyalty and Word of mouth (WOM) is accepted as key factors successes of marketing. This paper seeks to discover the affecting factors on positive word of mouth and loyalty based on trust enhancement ...

متن کامل

First Language Activation during Second Language Lexical Processing in a Sentential Context

 Lexicalization-patterns, the way words are mapped onto concepts, differ from one language      to another. This study investigated the influence of first language (L1) lexicalization patterns on the processing of second language (L2) words in sentential contexts by both less proficient and more proficient Persian learners of English. The focus was on cases where two different senses of a polys...

متن کامل

Does hearing two dialects at different times help infants learn dialect-specific rules?

Infants might be better at teasing apart dialects with different language rules when hearing the dialects at different times, since language learners do not always combine input heard at different times. However, no previous research has independently varied the temporal distribution of conflicting language input. Twelve-month-olds heard two artificial language streams representing different di...

متن کامل

رفع ابهام معنایی واژگان مبهم فارسی با مدل موضوعی LDA

Word sense disambiguation is the task of identifying the correct sense for the word in a given context among a finite set of possible sense. In this paper a model for farsi word sense disambiguation is presented. The model use two group of features: first, all word and stop words around target word and topic models as second features. We extract topics from a farsi corpus with Latent Dirichlet ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000